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Abstract 

The problem of recovering (count and sum) range queries over multidimensional 
data only on the basis of aggregate information on such data is addressed. This 
problem can be formalized as follows. Suppose that a transformation r producing 
a summary from a multidimensional data set is used. Now, given a data set D, a 
summary S = t(D) and a range query r on D, the problem consists of studying r 
by modelling it as a random variable defined over the sample space of all the data 
sets D' such that t{D') = S. The study of such a random variable, done by the 
definition of its probability distribution and the computation of its mean value and 
variance, represents a well-founded, theoretical probabilistic approach for estimating 
the query only on the basis of the available information (that is the summary S) 
without assumptions on original data. 



1 Introduction 



In many application contexts, such as statistical databases, transaction record- 
ing systems, scientific databases, query optimizers, OLAP (On-line Analyti- 
cal Processing), and many others, a multidimensional view of data is often 
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adopted: Data are stored in multidimensional arrays, called datacubes [19,24], 
where every range query (computing aggregate values such as the sum of the 
values contained inside a range, or the number of occurrences of distinct val- 
ues) can be answered by visiting sequentially a sub-array covering the range. 
In demanding applications, in order to both save storage space and support 
fast access, datacubes are summarized into lossy synopses of aggregate val- 
ues, and range queries are executed over aggregate data rather than over raw 
ones, thus returning approximate answers. Approximate query answering is 
very useful when the user wants to have fast answers, thus avoiding waiting a 
long time to get a precision which is often not necessary. 
Data aggregation and approximate answering have been first introduced many 
years ago for histograms [29] in the context of selectivity estimation (i.e. es- 
timation of query result sizes) for query optimization in relational databases 
[8,10,32,34]. In this scenario, histograms are built on the frequency distribution 
of attribute values occurring in a relation, and are constructed by partitioning 
this distribution into a number of non-overlapping blocks (called buckets). For 
each of these blocks, a number of aggregate data are stored, instead of the 
detailed frequency distribution. The selectivity a query is estimated by inter- 
polating the aggregate information stored in the histogram. 
Later on, several techniques for compressing datacubes and allowing fast ap- 
proximate answering have been proposed in the literature in the context of 
OLAP applications, where data to be summarized are called measure values 
(e.g., daily income of a shop, number of users accessing a service, etc.). Some 
of these approaches use cither sampling [15,16,23,25] or complex mathemati- 
cal transformations (such as wavelets) to compress data [14,36,38,39]. Indeed, 
the approach which turned out to be the most effective one (in terms of ac- 
curacy of the estimates) is the histogram-based one. In fact, both frequency 
distributions occurring in selectivity estimation and measure values in OLAP 
datacubes are multi-dimensional data distributions, which can be partitioned 
and aggregated adopting the same technique. Therefore, due to the increasing 
popularity of OLAP applications (which turned out to be particularly useful 
for the decision making process [7]), a renewed interest has been devoted to 
histogram-based compression techniques. Most of works on this topic mainly 
deal with either improving partitioning techniques in terms of efficiency and 
effectiveness [5,18,22,27], or maintaining the summary data up-to-date when 
the collected information changes continuously [11,20,21,37]. 

In this paper we address a different problem, which has been rather disregarded 
by previous research, and which is very relevant for an effective applicability 
of summarization techniques: We focus on the analysis of estimation errors 
which occur when evaluating range queries directly on summary data, without 
accessing original ones. Indeed, in all previous works dealing with histogram- 
based summarization techniques, either the estimation error is not studied at 
all, or only a rough evaluation of upper bounds of this error is given [28] . The 
lack of information on the estimation error reduces the scope of applicability 



2 



of approximate query answering: approximate results are really usable only if 
they are returned together with a detailed analysis of the possible error so that, 
if the user is not satisfied with the obtained precision, s/he may eventually 
decide to submit the query on the actual datacube. 

In more detail, we study the problem of estimating count and sum range 
queries issued on a compressed datacube in a rather general framework: we 
assume that compression has been performed by first partitioning a given dat- 
acube into a number of blocks using any of the various proposed techniques, 
and then storing aggregate information for each block. This aggregate infor- 
mation mainly consists in the sum and the number of the elements belonging 
to each block. Moreover, we assume that some integrity constraints, which are 
expressible in a succinct format, are stored. Our approach is independent on 
the technique used to partition the datacube into blocks: its concern is esti- 
mating values and accuracy of range queries using aggregate data - no matter 
how they have been obtained - using just interpolation with no assumptions 
on the actual distribution of data into aggregation blocks. 

The evaluation of the accuracy of estimates is based on a probabilistic frame- 
work where queries are represented by means of random variables, and is 
performed as follows. Given a datacube D and the compressed datacube S 
obtained from D by applying the histogram-based compression strategy intro- 
duced above, we denote the transformation from D to iS by r, thus S = t{D). 
Let now Vs denote the set of all the datacubes D such that t{D) = S. Ob- 
serve that any datacube D e T>s is a possible guess of the original datacube 
D, done only on the basis of the knowledge of S. So, if we are given a range 
query r on D, estimating r from S can be thought of as guessing the response 
to r on D by applying the range query r to any datacube D of Vg- According 
to this observation, we model the estimation of the range query r from S as 
the mean value of the random variable defined by r on the sample set Vs, rep- 
resenting all possible guesses of r compatible with the summary 5*. In order 
to analyze the estimation error, we thus study this random variable by de- 
termining its probability distribution and its variance. Actually, our analysis 
considers a family of transformations r based on the partition of the datacube 
into blocks, where each transformation stores different aggregate information 
for each block and a number of integrity constraints. The introduction of in- 
tegrity constraints allows us to take into account more detailed information 
than sum and count on a block, whose exploitation may bias significantly the 
estimation toward the actual value. Indeed, integrity constraints produce a re- 
striction of the sample space and a reduction of the variance of the estimation 
w.r.t. to the case of absence of integrity constraints. The integrity constraints 
which have been considered in this work concern the minimum number of null 
or non-null tuples occurring in ranges of datacubes. Although more complex 
constraints could be considered, we have restricted our attention to this kind 
of constraint since they often arise in practice. For instance, given a datacube 
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whose dimensions are the time (in terms of days) and the prodTicts. and the 
measure is the amount of daily product sales, realistic integrity constraints 
are that the sales are null during the week-end, while at least 4 times a week 
the sales are not null. 

Plan of the paper. The paper is organized as follows. In Section 2, a simple 
compression technique which will be used for explaining and applying our esti- 
mation paradigm is introduced, and integrity constraints about the number of 
null or non-null tuples in the datacube ranges are formally defined. In Section 
3 the probabilistic framework for estimating count and sum range queries is 
formalized. In Sections 4, 5 and 6 three different estimation paradigms (ex- 
ploiting different classes of aggregate information) are introduced: in Section 
4, sum queries are estimated by using only the information about the sum of 
the values contained in each bucket, whereas count queries are evaluated by 
exploiting only the information on the number of non null elements in each 
bucket. Section 5 shows how the information on the number and the sum of 
the non null values contained in each bucket can be used jointly to estimate 
sum and count queries. In Section 6 the estimation using integrity constraints 
is formalized and in the subsequent section we elaborate on the "positive" 
influence of integrity constraints on the accuracy of query estimations, and 
substantiate our claim with some experimental results obtained applying our 
estimation techniques to real-life data distributions. Some interesting applica- 
tions of our theoretical framework for the estimation of frequency distributions 
inside a histogram are presented in Section 8. Finally, in Section 9 conclusions 
and future research lines are discussed. In particular, we stress that our work 
is not certainly conclusive, since a larger family of transformations can be 
considered, by taking into account different aggregates and other integrity 
constraints. Indeed the main contribution of the work is the definition of a 
novel approach for modelling and studying the issue of approximate queries 
from a theoretic point of view. 



2 Datacubes, their Compressed Representation and Integrity Con- 
straints 

2.1 Preliminary Definitions 

In this section we give some preliminary definitions and notations. Let i = 
<Zi, . . . , ir> and j = <ji, . . . ,jr> be two r-tuples of cardinals, with r > 0. 
We extend common operators for cardinals to tuples in the obvious way: i < j 
means that ii < ji, . . . i,. < jr] i+j denotes the tuple <ii+ji, . . . , ir+jr> and 
so on. Given a constant p > 0, p** (or simply p, if r is understood) denotes 
the r-tuple of all p; for instance, if p = 1 and r — 5, the term 1 denotes the 
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tuple <1, 1, 1, 1, 1>. Finally, = [ii-.ji, . . . ,ir--jr\ denotes the range of all 
r-tuples from i to j, that is {q| i < q < j}. 

Definition 1 A multidimensional relation i? is a relation whose scheme con- 
sists of r > dimensions (also called functional attributes) and s > measure 
attributes. The dimensions arc a key for the relation so that there are no two 
tuples with the same dimension value. 

Prom now on consider given a multidimensional relation R. For the sake of 
presentation but without loss of generality, we assume that: 

• s = 1, and the domain of the unique measure attribute is the set of cardinals, 

• r > 1, and the domain of each dimension q, with 1 < ? < r, is the range 
[l..ng], where Ug > 2 (i.e., the projection of R on the dimensions is a subset 
of [l..n], where n = <ni, . . . , nr>). 

Given any range [i..j], 1 < i < j < n, we consider the following range queries 
on R: 

• count query: count^^^^^^{R) denotes the number of tuples of it! whose dimen- 
sion values are in 

• sum query: sum^^''^\R) denotes the sum of all measure values for those tuples 
of R whose dimension values are in [i..j]. 

Since the dimension attributes are a key, the relation R can be naturally 
viewed as a [l..n] matrix M of elements with values in J\f. In the rest of the 
paper this matrix will be called datacube. 

Definition 2 The datacube M corresponding to the multidimensional rela- 
tion R is the [l..n] matrix of cardinals such that, for each i £ [l..n], M[i] = v 
if the tuple <i, v> is in R or M[i] = 0, otherwise. 

As a consequence of the above definition, i is a null element if either <i, 0> is 
in R or no tuple with dimension value i is present in R. 
The above range queries can be now re-formulated in terms of array operations 
as follows: 

• count^'-^^{R) = count{M[i..i]) = |{q| q e [i..j] and M[q] > 0}|; 

• sum^'-^R) = sum{M[i..j]) = Eqe[,..j] M[q], 

where [i..j] is any range such that 1 < i < j < n. 
2.2 Compressed Datacubes 

We next introduce a compressed representation of the relation R by dividing 
the corresponding datacube M into a number of blocks and by storing a 
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Fig. 1. A two-dimensional datacube and its compressed representation 

number of aggregate data for each of them. 

First we need to formahze the notion of compression factor: 



Definition 3 Given m = <mi,..., mr>, 1 < m < n, an m- compression 
factor for M is any tuple F = </i, . . . , fr>, such that for each dimension q, 
I < q<r, fg is a [0..mq] array for which = fg[0] < fg[l] < ■■■ < /Jm^] = Ug, 
i.e., fg divides the dimension q into rriq parts. 

Observe that F partitions the range [l..n] into mi x ■ ■ ■ xrrir blocks. Each 
of these blocks, denoted as -Bk, corresponds to a tuple k = <ki, . . . , kr> in 
[l..m]. Each block has range [F~(k)..F+(k)], where F~^{'k) and -F~(k) 
denote the tuples <fi[ki], . . . , fr[kr]> and <fi[ki — 1] + 1, ... , fr[kr — 1] + 1>, 
respectively. The size of -Bk (i-e. the number of cells inside the range of 5k) is 

ifl[kl] - fl[k, - 1]) X ■ ■ ■ X{f,[kr] - fr[kr - 1]). 

As an example, consider the [1..10, 1..6] datacube M in Figure 1(a), which is 
partitioned into 6 blocks as shown in Figure 1(b). We have that m = <3, 2>, 
/i[0] = 0, /i[l] = 3, /i[2] = 7, /i[3] = 10, and ^[O] = 0, ^[l] = 4, /ap] = 6. 
The block i?<i,i> has size 3x2 and range [1..3, 1..4]; the block -B<i,2> has 
size 3x2 and range [1..3,5..6], and so on. 

Definition 4 Given an m-compression factor F, a (F-) compressed represen- 
tation of the datacube M is the pair of [l..m] matrices Mcount,F and Msum,F 
such that for each k G [l..m], Mcount,F^ = co-unt(M[F~(k)..F+(k)]) and 
M,„^,^[k] =sMm(M[F-(k)..F+(k)]). ' 

The compressed representation of the datacube M in Figure 1(a) is represented 
in Figure 1(c), where each block is associated to a triplet of values. These 
values indicate, respectively, the range, the number of non-null elements and 
the sum of the elements in the corresponding block. For instance, the block 
5<i i> has range [1..3, 1..4] and contains 8 non-null elements with sum 26; 
the block -B<i,2> has range [1..3,5..6] and contains 5 non-null elements with 
sum 29, and so on. 
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From now on, consider given an m-compression factor F and the corresponding 
F-compressed representation of the datacube M. 



2.3 Integrity Constraints 

The aim of compressing a datacube is to reduce the storage space consumption 

of its representation, in order to make answering range queries more efficient 
to perform. In fact queries can be evaluated on the basis of the aggregate data 
stored in the compressed datacube without accessing the original one, and 
the amount of data that must be extracted from the compressed datacube to 
answer a query is generally smaller than the number of data that should be 
extracted from the original datacube. This approach introduces some approxi- 
mation, which is tolerated in all those scenarios (such as selectivity estimation 
and OLAP services) where the efficiency of query answering is mandatory and 
the accuracy of the answers is not so relevant. 

The estimation of queries could be improved (in terms of accuracy) if further 
information on the original data distribution inside the datacube is available. 
Obviously, this additional information should be easy to be exploited so that 
the efficiency of the estimation is not compromised. 

In this section we introduce a class of integrity constraints which match these 
properties: they can be stored in a succinct form (thus they can be accessed 
efficiently), and provide some additional information (other than the aggre- 
gate data stored in the compressed datacube) which can be used in query 
answering, as will be explained in the following sections. 

Let 2[-'^- "l be the family of all subsets of indices in [l..n]. We analyze two types 
of integrity constraint: 

• number of elements that are known to he null: we are given a function LB=q : 
2[i..n] _^ returning, for any D in 2[-'^-"l, a lower bound to the number of 
null elements occurring in D; the datacube M satisfies LB=q if, for each D 
in 2[i-"l, Y.reDCOunt{M\}\) < \D\ - LB=o{D), where \D\ is the number of 
elements of M in D; 

• number of elements that are known to be non-null: we are given a function 
LB^o : 2[-'^""l — > J\f returning, for any D in 2[-'^""l, a lower bound for the 
number of non-null elements occurring in D; the datacube M satisfies LB^o 
if, for each D in 2[i-"l, count{M[i]) > LB>o{D). 

The two functions LB=o and LB^q are monotonic: for each D',D" in 2[-^' "], 
if D' C D" then both LB=o{D') < LB=o{D") and LB>o{D') < LByo{D") 
hold. Prom now on, consider given the above two functions together with the 
compressed representation of M. 

We point out that the integrity constraints expressed by LB=o and L-B>o often 
occur in practice. Por instance, consider the case of a temporal dimension with 
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granularity day and a measure attribute storing the amount of sales for every 
day. Given any temporal range, we can easily recognize a number of certain null 
values, corresponding to the holidays occurring in that range. In similar cases, 
the constraints provide additional information that can be efficiently computed 
with no overhead in terms of storage space on the compressed representation 
of M. 

As an example on how LB=q and LB^q influences the estimation of range 
queries, consider the following case. Suppose that LS=o([4--6, 1..3]) = 3 and 
LS>o([4..6, 1..3]) = 1 for the two-dimensional datacube of Figure 1. Prom this, 
wc can infer that the number of non-null elements in the range [4. .6, 1..3] is 
between 1 and (6 — 4+1) x (3— 1 + 1) — 3 = 6. Note that the compressed 
representation of M in Figure 1(b) only contains the information that the 
block [4.. 7, 1..4] has 7 non- nulls; so, without the knowledge about the above 
constraints, we could only derive that the bounds on the number of non-null 
elements in [4. .6, 1..3] are and 7. 



3 The Probabilistic Framework for Range Query Estimation 

We next introduce a probabilistic framework for estimating the answers of 
range queries {sum and count) by consulting aggregate data rather than the 
actual datacube. To this aim, we view queries as random variables and we give 
their estimation in terms of mean and variance. 

A range query Q on the datacube M is modelled as a random variable Q 
defined by applying Q on a datacube M extracted from a datacube popu- 
lation compatible with M, thus consisting of datacubes whose F-compressed 
representations coincide (at least partially) with that of M. 

More precisely, we have different random variables modelling Q, depending 
on what exactly we mean for 'compatible', and thus on the datacube popu- 
lation on which the query is applied. In particular, we consider the following 
populations: 

• M~p is the set of all the [l..n] matrixes M' of elements in jV for which 

^count,F — Mcount,F', 

• M~p is the set of all the [l..n] matrixes M' of elements in J\f for which 

• ^cs^F the set of all the [l..n] matrixes M' of elements in Af for which 
^count,F = Mcount.F and M'^^^^p = Mg^^y, 

• ^LB=o,LB>o{M^^^p)= {M'\ M' e M~^jr A M' satisfies both LB=o and LB^q } is 
the sub-population of M^^^ which also satisfy the integrity constraints. 
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On the whole, given a range 1 < i < j < n, of size (i.e., number of 

elements occurring in it) we study the following six random variables, 
grouped into three cases: 

Case 1: For the estimation of count {M[i..j]) we consider the population of all 
datacubes having the same number of non-nulls in each block as M, and for 
that of sum{M[i..ji]) the population of all datacubes whose blocks have the 
same sum of the corresponding blocks in M. Thus, we study the following 
two random variables: 

- The random variable Ci(6i..j), computing count(M[i..j]), where M is ex- 
tracted from the population M~p. 

- The random variable -S'i(6i..j), computing sum{M[i..j]), where M is ex- 
tracted from the population M~p. 

Note that, as will be clear in the following, both the random variables above 
are only function of the size of the range size, and not of its boundaries 
1 and j. 

Case 2: We estimate the number and the sum of the non-null elements in 
M[i..j] by considering the population of all the datacubes whose blocks have 
both the same sum and the same number of non-nulls as the corresponding 
blocks in M. Then, the random variables are: 

- The random variable C2(6i..j), computing count {M[i..ji]), where M is ex- 
tracted from the population M~^p. 

- The random variable 5'2(6i..j), computing stim(M[i..j]), where M is ex- 
tracted from the population M~^p. 

Again, C2 and 5*2 depend only on the size of the range and not on the range 
itself. 

Case 3: We consider the population of all datacubes having both the same 
sum and the same number of non-nulls in each block as M, and, besides, 
satisfying the lower bound constraints on the number of null and non-null 
elements occurring in each range. Thus, we study the following two random 
variables: 

- The random variable C3([i..j]), computing coMnt(M[i..j]), where M is ex- 
tracted from the population Ii-LB=o,LB>o{M^p) . 

- The random variable S'3([i..j]), computing sMm(M[i..j]), where M is ex- 
tracted from the population IiLB^a,LByo{M^^p). 

In this case, differently from the previous ones, the examined random 
variables are function of the range [i..j] (not only of its size) as the value 
returned by LB=q and LB^q depend on the considered range. 

We observe that Case 2 can be derived from the more general Case 3 but, for 
the sake of presentation, we first present the simpler case, and then we move 
to the general one. Actually the results of Case 2 will be stated as corollaries 
of the corresponding ones of Case 3 and their proofs will be postponed in the 
Appendix. 
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For each random variable above, say cs(M[i..j]) (where cs stands for count 
or sum), we have to determine its probabihty distribution and then its mean 
and variance. Concerning the mean E ^cs(M[i..j])j, due to the hnearity of E, 
we have: 

E(cs(M[i..j]) = J2 M,s,Fh]+ Yl E{cs{M\i^..i^)) 

Bq6TBF(l..j) BkePSjr(l..j) 

where: 

(1) Ti?ir(i..j) returns the set of blocks that are totally contained in the 
range [i..j], i.e. every block i?q such that both i < -F~(q) and F~^{q) < j, 

(2) PBp{i..j) returns the set of blocks that are partially inside the range, 
i.e. 5k J T5ir(i..j) and either i < F-(k) < j or i < F+(k) < j, and 

(3) for each i?k £ PBp{i..j), and jk are the boundaries of the portion 
of the block 5^ which overlaps the range i.e., [ik--jk] = H 
[F-(k)..F+(k)]. 

For instance, consider the datacube in Figure 1(a) and the range whose 
boundaries are i = <4, 3> and j = <8, 6>. Then the block i?<2,2> is to- 
tally contained in the [i..j], the blocks -B<2,i>, -B<3,i>; -B<3,2> are partially 
contained in whereas the blocks 5<i,i>, -B<i,2> are outside 

Concerning the variance, we assume statistical independence between the mea- 
sure values of different blocks, so that its value is determined by summing the 
variances of all the partially overlapped blocks, thus introducing no covariance: 

a\cs{M[i..i]) = ^ a\cs{M[ik..m- 
BkePSj.(i..j) 

It turns out that we only need to study the estimation of a query inside one 
block, as all other cases can be easily re-composed from this basic case: the 
estimate of a query involving more than one block is the sum of the estimates 
for each of the blocks involved, and the same holds for the variance. 

Therefore, from now on wc assume that the query range is strictly inside 
one single block, say the block i?k, i-e. -^~(k) < i < j < F"'"(k). We use the 
following notations and assumptions: 

(1) 6 is the size of B^, that is the total number of null and non-null elements 
in Sk; 

(2) 6i j is the size of the query range that is the number of elements in 
the range (1 < < 6); 

(3) t — Mcount,F[^] is the number of non-null elements in B^ {1 < t < b); 

(4) s — Msum,F[^] is the sum of the elements in B^. 
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4 Case 1: using the number and the sum of non-null elements sep- 
arately 



In this section we study the estimation of count and sum queries on the basis 
of the sum and count information given for each block (that is, the sum s of 
the elements occurring in each block, and the number t of non null elements 
in it). 

Let us first perform the estimation of the range query count {M[i..ji\). Notice 
that the random variable representing the answer of a count query depends 
on the size of the range [i..j] involved in the query, rather than on the 
position of the range in the block. 

Theorem 1 Let Ci(6i..j) = co-unt(M[i..j]) be the integer random variable 
ranging from to t defined by extracting M from the datacube population 
M-^. Then: 

(1) the probability distribution P{Ci{bi,_j) — ti .j) is: 
( b..i \ ( h-h..; \ 




if max{Q, — (6— t)} <ii..j <min{t, 



otherwise 
(2) mean and variance are, respectively: 

E{C,{h...^)^{h..-Jh)-t 

a\C,{k..i)) ^ t ■ {b - t) ■ b...y 

Proof. It is easy to see that the probability that the number of non-null 
elements is ii. j corresponds to the probability of extracting j times the value 
1 in trials, using a binary variable (with values e, corresponding to a 
null value, and 1, corresponding to non- null) in a sample set composed of 
b variables, with probability of finding 1 equal to t/b. This case is known 
to be characterized by the above expression, that is called hypergeometric 
distribution [13]. Thus, mean and variance are those of a random variable 
following a hypergeometric distribution. □ 



The diagram in Fig. 2 shows how the variance of Ci(6i..j) changes when we 
vary t/h for a query of size — b/2 and a block of size b — 1000. 
The estimated error is maximum for t = b/2 and behaves symmetrically for 
t > b/2 and t < b/2. This result can be explained by observing that, when 
t — b/2, the uncertainty in the distribution of null elements is maximum, 
since the probability that a fixed element inside the block is null is the same 
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Fig. 2. cr(Ci(6,..j)) versus t/h for a block of size b = 1000 and a query of size = 6/2 



as it is not null. The variance is symmetric w.r.t. t = 6/2 as the error which 
occurs when we estimate count on a block of size h containing t non null 
elements is equal to the error of the estimate of the same range query over a 
block with the same size 6, but containing h — t non null elements. 

The behavior of cr(Ci(6i..j)) w.r.t. 6i..j/6 is analogous to the behavior of a(Ci(6i..j) 
w.r.t. t/h: The estimated error is maximum for 6i j = 6/2, and is symmetric for 
6i..j > 6/2 and 6i .j < 6/2. The maximum uncertainty in the estimated result 
is reached when the size of the query is an half of the size of the whole block. 
The estimation becomes more accurate as 6,..j gets near to 6 or to 0: When 
6i..j = 6 the computed answer of the query is exact and is given by whereas 
if 6i..j = the returned answer is zero. 

The maximum estimation error which may occur when £'(Ci(6i. j)) is returned 
as the answer of the range query count {bi__^), denoted by err^^-^ , is quantified 
next. 

Proposition 1 

g^^MAx _ ^Q^™ 1^ . ^ _ 777,ax{0, t — (b — 6i..j)}, min{t, 6i..j} ^ -j^ ■ 

Proof. The maximum error can be obtained when the actual number of non 
null elements inside the range of the query is either minimum (i.e. count{b^,,j) ~ 
max{0, t — {b — 6i..j)}) or maximum (i.e. countipi,,-^ = min{t, 6i..j}). □ 

Let us now study the random variable s-um(M[i..j]) representing the answer 
of a sum query on the range [i..j] assuming only the knowledge of the sum 
s of the elements occurring in the block k. Once again, the estimated value 
depends on the size of the range involved in the query and not on its actual 
position in the block. 

Theorem 2 Let 5'i(6i.j) = SMm(M[i..j]) be the integer random variable rang- 
ing from to s defined by eoctracting M from the datacube population M~p. 
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Then: 



(1) the probability distribution of Si{bi,,j) is: 




z/0<Si..j <s 



otherwise 
(2) mean and variance are, respectively: 

E{S,{h.:^) = {h..-Jb) ■ s 

{b - -ib + s) 



a\S,{b,.:^)=b,..ys 



b^-{b + 1) 



Proof. (1) We can see the block k as a vector y of 6 elements which can assume 
values between and s. Let Vf,, j be the portion of V of size containing the 
elements inside the range of the query, and let Vf,-^^ . be the remainder part of 
V. The random variable 5*1(61^) represents the sum of the elements belonging 
to Vf,^ J. The probability that 5*1(61^) assumes the value Si. j can be obtained by 
considering all possible value assignments to the elements in V^^ . so that their 
sum is Si. j, combined with all possible value assignments to the elements of 
Vb-b, . so that the total sum is s. The above considered assignments represent 
the cases of success. The number of possible cases can be similarly obtained 
by considering all possible value assignments to the b elements in V so that 
their sum is s. 

The number of all the possible assignments from the domain of cardinals to 
y elements whose sum is z is equal to the number of multisets with elements 
taken from the set {1, ...,y} and having cardinality z: ^ ^ z~ ^ 
Thus, the number of possible assignments for the elements in the portion 
.j is: A = ^ ^i -j ~ ^ ) ' ^bsreas the assignments for Vb-b, j are: B — 

( ^^~^'"^\s-l~)'"^'' ~ ) ■ Analogously, there are C = (| ^ + | " ^ ^ different 
assignments of cardinals to the elements in the whole V such that the sum is 
s. Hence, the probability that the sum inside a range of size 6, j is Si j is given 

(2) Consider the vectors V, Vb^ . and Vb-b^ . defined above. The event = 
Si.j) is equivalent to the following event: The sum of all the elements in Vb^ ^ 
is Si .j. Let V[i] be a random variable corresponding to the i-th element of V. 
From s = I]i<j<5l^[i], we derive s = J2i<i<bE{V[i]) by linearity of the op- 
erator E. The mean of the random variable V[i] is equal to the mean of the 
random variable V[j], for any i,j, 1 <i,j <b: For symmetry, the probability 
that an element of V assumes a given value is independent on the position of 
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this element inside the vector. Let denote by m this mean. From the above 
formula for s it follows that m-h = s, thus m = s/h. Consider now the vector 
Vfe^ . . Let S' be the random variable representing the sum of all the elements 
of'l4^ .. Then E{S') = • m. Hence, E{S') = ■ s/b. 

The variance can be obtained using its definition. The detailed proof is rather 
elaborated and, for the sake of presentation, is included in the Appendix as 
Claim 1. □ 



The maximum estimation error which may occur while returning 

as the answer of the range query sum{bj,_j), denoted by err^^"^^, is quantified 

next. 



Proposition 2 

Proof. The maximum error occur when the elements inside the range of the 
query are either all null or all non-null. □ 




Fig. 3. a{Si{bi,,yj) versus and s for a block of size b = 1000 



In the diagrams of Fig. 3 we show how the standard deviation of 5*1(61^) 
changes, respectively, when we vary (with s = 1000), and when we vary 

s (with bi„j = 6/2) for a query over a block of size b = 100. The behavior 
of the estimated error w.r.t. is the same as that of cr(Ci(6i„j)): The 

standard deviation is maximum for = 6/2 and is symmetric for 6i..j > 6/2 
and 61 j < 6/2. As shown in the diagram on the right-hand side of Fig. 3, the 
estimated error increases as the sum of the elements contained in the block 
increases: this result is rather expected, as the variance can be thought of as 
an estimate of the absolute error. 
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5 Case 2: using the number and the sum of non-null elements 
jointly 



We now perform the estimation of count and sum queries by exploiting sum 
and count aggregate information simultaneously. This issue consists in study- 
ing the conjunction of two events: The value of the sum (in a range of size 
is and the number of non nulls (in the same range) is ti.j. In this 
case, count and sum queries arc evaluated on datacubcs belonging to M~ p. 
More precisely, we have to study the joint probability distribution of the two 
random variables (representing the answer of the count query and the sum 
query), in order to derive the two probability distributions. As this case can 
be viewed as a specialization of Case 3 (where also integrity constraints will be 
exploited to evaluate the estimates - see Section 6), results on this estimation 
strategy are formalized in the following corollaries, whose proofs are reported 
after the proofs of the corresponding theorems of Case 3. 

Corollary 1 Let C^iW,-^ — count{M\}...'}\) and S2{K,^) — sum{M[i..j]) be two 

integer random variables ranging, respectively, from to t and from to s, 
defined by extracting M from the datacube population M~^p. Then the joint 
probability distribution P(C2(6i..j) = >S'2(6i..j) = Si.j) is given by: 



Q(&:...i, g:...i) " Q{h t - S - 5,.j) < < 



otherwise 



where Q{x, y, z) is equal to: 




{y = A z>0) V {y>0 A z<y) V y>x 
Qix,y,z)^{ / _ ^ ify = OAz = 

otherwise 

With the next corollary we formalize a first result about the estimation of the 
count query using both count and sum information: that is, the estimation 
of the count query cannot exploit the aggregate information about the sum 
of the elements in a block. Therein, we derive the probability distribution of 
C2(&i..j), its mean and its variance. In particular, we obtain that the probability 
distribution of C2(&i..j) coincides with that of Ci(foi..j), representing the answer 
of the count query when only the knowledge of t is given. 

Corollary 2 The probability distribution of the random variable 6*2(61..]) de- 
fined in Corollary 1 is: -P(C2(&i..j) =ti..j) = P(Ci(6i..j) = ti .j), where Ci(6i..j) is 
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the random variable defined in Theorem 1. 

From the corollary above, it follows that also mean and variance of C2(^i..j) 
are the same as those of Ci(6i..j), as well as the maximum estimation error 
which may occur while returning E{C2{K,,j)) as the answer of the range query 
count{M\}...}]) is the same as that of Case 1 (Proposition 1). 

Now, we derive mean and variance of the random variable S'2(6i..j), representing 
the estimated answer of a sum query given the knowledge of t and s. Its 
probability distribution is given by P (5'2(&i..j) = Si.j) = Y.Q<t, j<t -^(^2(^1.0) — 
S'2(6i..j) = according to the definition of joint probability distribution. 

Corollary 3 Mean and variance of the random variable S'2(6i..j) defined in 
Theorem 1 are, respectively: 

^'(^^(^-J)) = J'.{b-l)'it + \) ■[b-(^-s-t + l)-s.(t + l)]. 

Next we derive the maximum error err^^^^ produced by estimating the answer 
of the range query sMm(M([i..j])) by means of E{S2{bi„j)). 

Proposition 3 

g^^MAx _ ^Q^rf. l^-s — max{0, t — (&— s — max{0,t — bi,,j} — %^ ■ s} 

Proof. As non-null elements have a value equal to or greater than 1, the 
minimum value of the sum inside [i..j] is given by the minimum number of 
non null elements occurring in this range, that is max{0,t — {b — The 
maximum value of sum{bj _j) is reached when the number of elements outside 
the range of the query is minimum and all of them have minimum value 
(i.e. 1). As the minimum of co-unt ([i..j]) is given by max{0,t — 61. j}, it holds 
that the maximum value of sum{bi_,^) is given by: s — max{Q,t — The 
formula expressing the maximum error is obtained by considering the cases 
when sum{bi_,^ is either maximum or minimum. □ 

The main consequence of Corollary 2 is that the knowledge of s does not influ- 
ence the estimation of the answer of a count query: The probability distribu- 
tion of C2(&i..j) coincides to that of Ci(6i..j). On the other hand, the knowledge 
of the number of null elements in each block changes the estimation of the an- 
swer of a sum query: The probability distribution of 5*2(61.^) is different w.r.t. 
that of S'i(6i..j). Indeed, the two random variables have the same mean but 
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different variances. In Fig. 4 we show cr(S'2(6i..j)) (dashed hne) and a{Si{bi„j)) 
(dotted hne) versus t/b for a query of size 50 on a block of size 100 whose 
elements have sum 1000. 
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Fig. 4. a{Si{b,„j)) and (7(52(6,.^)) versus t/b for b = 100 and s = 1000 

The standard deviation cr(5'2(6i..j)) is a decreasing function of as i gets near 
fe, (T(5'2(6i..j)) decreases, and reaches a minimum for t — b. The influence of t 
on the value of the estimated error can be strong. For instance, if t = 6/3 the 
value of a{Si{bi j)) is approximately an half of the value of o'(S'2(&i..j)). The 
measure of the error provided by cr(S'2(6i..j)) is generally greater than that 
obtained by means of S'i(6i..j), but is more truthful. For instance, if t has a 
'small' value (w.r.t. 6), we have that fT(yS'2(6i..j)) » a{Si{bj„^)). In this scenario, 
c(5'2(6i..j)) provides a better description of the case, since when t <^ b the block 
is very sparse, the sum is distributed among few elements and, as there is no 
information about the exact position of non null elements, there is no way to 
decide whether the non null elements are inside or outside the range of the 
query. 

Note that for t = 6 , cr(5'2(6i..j)) < a{Si{bi__j)). Indeed, o'(5'i(6i..j)) is an in- 
creasing function of s and, when t — b, evaluating a{S2{bi,,j)) is the same as 
evaluating a{Si{bi__y)) over a block of the same size (i.e. b) whose elements 
have sum s — b. 




6 Case 3: using integrity constraints 

In this section we show how the knowledge of both lower bounds and upper 
bounds on the number of non-null elements derived by the functions LB=q 
and LB^Q can be exploited in the estimation process. We use the following 
additional notations: 

(1) ij^j = — LB=o{[i..j]) and tf'^ = Li?>o([i..j]) are respectively an upper 
bound and a lower bound on the number of non-null elements in the range 

[i.J]; 

(2) t~. = 6~j — Li?=o([i--j]) and t~. = Li?>o([i..j]) are respectively an upper 
bound and a lower bound on the number of non-null elements in the block 
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Bk outside the range [i..j] — [i..j] denotes the set of elements that are in 
i?k but not in the range 
(3) = t^,,^ + t^. = b - L5=o([i-j]) - ^5=o([i~j]) and = tt; + t^~. = 
Li?>o([i--j]) + -^-B>o([i^j]), that is and are respectively an upper 
bound and a lower bound on the number of non-null elements in 

We define the random variables count{M[i. .}\) and sum{M[i..]]) by extracting 
M from the population ^LB=o,LB>a{^cs,F) ■ point out that, differently from 
the previous cases, the random variable representing the answer of a query 
also depends on the position of the range [i..j] in the block and not only on 
its size. This is because integrity constraints contain information about the 
position of null elements in the block, and two distinct ranges of the same size 
and belonging to the same block may have different upper bounds and lower 
bounds on the number of null and non-null elements. 



Theorem 3 Let C3([i..j]) = coMnt(M[i..j]) and 5'3([i..j]) = sMm(M[i..j]) he 
two integer random variables ranging, respectively, from to t and from to 
s, and defined over the datacube population IlLB=o,LByoi^cslF) ■ Then, for each 
fi. j and Si.j; such that t^^ < j < tj^j, and < Si .j < s, the joint probability 
distribution P(C3([i..j]) = S'3([i..j]) = Si.j) is equal to: 



N{t^,t,s,t^) ' 



f^.<f ■ < ■ 
, ''i.j — '-i-j — h.jj 
wriere: 

otherwise 



where t~, = t — t^,_j, s~, = s — Si.j, and 



N{tu,t,s,ti) = < 




if t>tuVt>s\/ {t = OAs>0) 
if t = As = 
otherwise 



Proof. N{x, y, z, v) represents the number of configurations of a vector of size 
X containing y non null elements with sum z such that we know the exact 
position of v of them. If y = and z — there is an unique configuration (all 
elements are null), and so N{x,y,z,v) = 1. Furthermore, it is not possible 
that y = and ;2 > (if the sum is greater than there must be at least one 
non null element), or that y > and z < y (each non null element has at least 
value 1), or that y > x (the number of non null elements cannot be greater 
than the size of the vector): in such cases N{x, y, z, v) — 0. 

Otherwise, iV(a;, y, z, v) can be obtained by disposing on x — v positions the 
y — v non null elements of which we don't know the exact position (that can be 
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accomplished in ^ ^ _ ^ j different ways) and, for each of these configurations, 

by distributing the sum s on y elements. This can generate ^ ^ ~ ^ ^ different 

configurations. The value of N{x, y, z, v) is given by the product of these two 
quantities. 



The probabihty distribution does not change if we remove from the block i?k 
the elements which are certainly null, according to the constraints expressed 
by LB=Q. The block we obtain removing such elements can be seen as a 
vector V of size , and the query re-formulated over defines a sub- vector 
VfU of V which has size tY z. 



In order to evaluate the total number of "successful" configurations for the 
entire vector V , we have to observe that for each successful configuration for 
the portion Vm we have a number of configurations for the remainder portion 
of the vector, say V^u_£u which is equal to the number of ways of disposing 

t — ti. j non null elements on — places, having that their sum is s — y 
Thus, the cases of success are given by A^(t[^j, j, j, ■N{t^^, t~j, 
appearing as numerator in the expression of the statement. The denominator 
N(tF ,t,s,t^) can be similarly obtained by considering that the number of 
possible cases are all the configurations of the vector V such that the number 
of non null elements is the sum is s, and satisfying both LB=q and LB^q. 

□ 



Results stated in Theorem 3 can be used to prove Corollary 1. We recall 
that Corollary 1 concerns the definition of the probability distribution of the 
random variables C2{K,j) and S'2(6i..j) defined in Case 2, where no integrity 
constraints were considered. 



Proof of Corollary 1 We first observe that Case 2 corresponds to Case 
3 with trivial bounds, i.e., LS=o([i..j]) = L5>o([i..j]) = 0. Then t^^ = b,„j, 
tf'^ — 0; so the expression for P(C3([i..j]) = S'3([i..j]) = (see Theorem 
3) reduces to the one of P(C2([i..j]) = 5'2([i..j]) = □ 



Theorem 4 Let Csdi-.j]) be the random variable defined in Theorem 3. Then: 



(1) the probability distribution P{Cz{\i..]]) = ti.j) is: 
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otherwise 
(2) Mean and variance of the random variable C3([i..j]) are: 



tti + • (t - i^) if t^ > 

ttj if t"" = 



<^'(C3([l..j]))^ 



(2) 



Proof. (1) The probability distribution of C3([i..j]) can be obtained by con- 



sidering that P (C3([i..j] = t^J = E:,.,=o ^(Cadi-j]) = U..;, S,{[i..S\) 



and applying the following equation: 



E 



«...j=*.. 




Si.. 



s - 
s 



which holds as both its left-hand side term and right-hand side term represent 
the number of sets containing t cardinals (strictly greater than 0) with sum s. 

(2: computation of the mean) If f^^' = fL it is the case that all null and non null 
elements are located by integrity constraints. Therefore, P (^Cs{[i--ji] = = 
1. Otherwise, if > t^ we can reason as follows. The block k can be viewed as 
a vector V of b elements whose values range between and s. Let Vii.j] be the 
portion of V corresponding to the range and let Vj~jj be the remainder 
part of V. The event (C3([i..j]) = ti„j) is equivalent to the following event: 
The sum of all elements in Vji j] is Si.j. Let V[i] be a random variable which 
assumes the value 1 if the i-th element of V is not null, the value otherwise. 

FYom t = j:i<i<bV\i], we derive t ^ t^ + Ei<i<6ALB>o(0=0ALB=o(0 '^[^1 + 
Ei</<6ALB>o(0=0ALB=o(0-^(^[^]) linearity of the operator E. The mean of 
the random variable V[i] is equal to the mean of the random variable V[j], for 
any i,j s.t. I <i,j <b and LB>o(i) = LB^o{j) = LB=o{i) = LB=o{j) = 0: 
For symmetry, the positions which are not localized by the integrity constraints 
have the same probability of containing null or non null elements. Let m be the 
mean Prom the above formula for t, it follows that m-{t^—t^) — t—t^. 

Consider now the vector Vji.j]. Since C3([i..j]) can be seen as the random 
variable representing the number of non null elements of Vji.j], we have that: 
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E{Cs{[i..j])) = tti + itY„i-tti)-m. Hence, E{Cs{[i..j])) = tt^ + '-^^ . (t-t^). 

(2: computation of the variance) If t^ — t^, as explained for part (1) of this 
proof, it holds that P (C3([i..j] = i^j) = 1, and therefore (C3([i..j]). lit^ = 
t^ + 1, two cases can occur: 1) t = t^ , or 2)t = t^ . In the former case, 
P (C3([i..j] = t^Jj = 1 holds, whereas in the latter one P (C3([i..j] = t^^J = 1. 
In both cases, we have that: (C3([i..j]) = 0. 

The formula expressing for > + 1 can be obtained using the defini- 
tion of variance. The detailed proof is rather elaborated and, for the sake of 
presentation, is reported in Appendix as Claim 2. □ 

Results stated in Theorem 3 can be used to prove Corollary 2, as the random 
variable C2(6i..j) can be seen as a special case of C3([i..j]). 

Proof of Corollary 2 As shown in the proof of Corollary 1, tj^j = fei. j, tfj = 0. 
For the same reasons, t^ = b and t^ = 0. By performing these substitutions, 
the statement of Theorem 4 reduces to that of Corollary 2. □ 

Let us now quantify the maximum estimation error err^^"^"^ which may occur 
while returning ii^(C3([i..j])) as the answer of the range query count 

Proposition 4 

g^^MAx ^ {£;(C3([i..j]) - maa;{t^'j, t - {t^ - tj^^j)}, 
min{tl^,t-{t^-tti)}-E{Cs{[i..3])} 

Proof. The minimum number of non null elements which could be contained in 
the range of the query is given by: maa;{t;^j, t — t -.} = max{t^j,t—{t^—tYy)}, 
whereas the maximum of count {[i..}]) is: mm{tj^j, t — t-.} = mm{tj^j, t — {t^ — 
i^ j)}. The formula of the maximum error is obtained by considering the cases 
where the actual number of non null elements inside the range of the query is 
either minimum or maximum. □ 

We now focus our attention on the random variable 5'3([i..j]), whose mean 
and variance are computed in the following theorem. Results stated in this 
theorem will be used, in the following, to prove Corollary 3. 

Theorem 5 Mean and variance of the random variable 5'3([i..j]) defined in 
Theorem 3 are: 
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J.L s 
''i-j ■ t 



if > 

if = 



<^'(^3([l..j])) = 



if t^>t^ + l 



(p+2-a-tti)it^,,^-tti)-:^+{a-tti'+f3-tti)-J 



if = 



where: 



Proof. Let us first prove the formula expressing E (5'3([i..j])). We assume that 
> SiS the proof for the case — t^ trivial. 

E (53([l..j])) = E E S,., . P (C3([l..j]) = 53([l..j]) = ...j) = 



= E E s..y 



t-t"- 



s-t 



t-it^-t^.) 



E 



t 



t-t^ 



s-l 
s-t 



«...j=*.. 



s-s,..j-l 



The term: 



E 



can be re- written, by replacing Si.j with ^i.j + as: 



E s.. 



Si..j=ti... 
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s-t 



where: S — s — t. 



Since ( ^ ) = ^ ^ ^ • ( ^ ^ ) ' results that: 



s...j=o V 



s 



5-1 



^ , . (t,.j + l)+g,.j-l (i+l)-(i,.j + l) + (5-l)-Q,.j-l 

Qfr=o V y V - 1) - 

where: Qi.j = S'l.j — 1. 

Now observe that the following holds: 



k=0 



y + k — I \ I X — y + z — k — 1 \ _ ( x + z — I 
k '[ z-k ~ [ z 



(2) 



since both the above terms represent the number of sets containing x naturals 
(including zero) such that their sum is z. 

Then, by applying formula (2) we obtain: 



5-1 



t - + S- S,„i - 1 

s - s,. 



E ^-J- + . (i+l)-(t...j + l) + (5-l)-Q,.j-l 



Qi..j=o 



J\ (5-l)-Q,.j 



_, / t + S-l\_. S ( t + S-l\_. s-t_ ( s-l 
-^'■■-^-\ S-l j t'\ S j t '[s-t 
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and: 



s 

S...i=0 



t-t, j + S - S, j-1 



t + S-l 



s-l 
s-t 



By replacing these two terms in (1), we obtain: 



Si..j=ti..j 



S - Si..j — t + U„ 



so that: 



^(^3([l..j]))= E 

'^i ■J-S. i 



''i-J ''i-J 



f ■ —V 



t-t' 



Moreover, it holds that: 



t-{t^-t^.) 

'^'■■j-^.j 



A t --t^. 



m—hi„j 



where: = — i^j, = — i^'.j, m — t — t^, 



As 



X — 1 

y-1 



we have that: 



E h., ■ I 

m— 1 

= E ^•..j 

P...j=0 



m — /ii.j 



E ^...j 



m-1- pi.j 



, - 1 J ■ ( 
where: Pi j = /ij. 
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By applying the Vandermonde formula: 



i=0 

we obtain: 

m—l / 



P...J=0 

and: 



Pi.j J \ m — 1 — J I m — 1 



m 



After replacing these terms in (3), we obtain: 



As regards the proof of the formula expressing the variance, for the sake of 
presentation, this proof is postponed in Appendix as Claim 3. □ 

Proof of Corollary 3. By applying the same arguments used in the proof of 
Corolaries 1 and 2, it is easy to see that the statement of Theorem 5 reduces 
to the statement of Corollary 3. □ 



The maximum estimation error err_^ which may occur while returning 
£^(S'3([i..j])) as the answer of the range query sMm([i..j]) is evaluated next: 

Proposition 5 

g^^MAx ^ {e{Ss{[i..}]) - max{tty t - {t^ - tjfj)}, 

s - min{t'^ - tti,t- tf^j} - ^(^3([i-j])} 

Proof. The maximum error can be obtained when the actual sum inside the 
range of the query is either minimum or maximum. This sum is minimum if 
the number of non null elements inside [i..j] is minimum, and each of these 
non null elements has the minimum value (i.e. 1). Thus, the minimum sum 
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inside [i..j] coincides with the minimum value of count{[i..j]) and is given by 
max{t^ — Li?>o([i~j])}) that is: t — + tf"y On the other hand, the value 
of SMm([i..j]) is maximum if the number of non null elements outside [i..j] is 
minimum, and if all of non null elements in [i~j] have value 1. Therefore, the 
maximum value of the sum inside [i..j] is given by: s — max{t~.,t — t^^} ~ 
s - max{t^ - j ,t-t\^^}. □ 

Remark. Note that, unlike the mean values of and 5*2(61.^), the value 

of £'(5'3([i..j])) generally depends on t. That is, when the integrity constraints 
provided by LB^q are exploited, the estimated answer of a sum query depends 
on the number of non null elements occurring in the whole block 5k • This 
difference between Case 3 and previous cases can be explained as follows. In 
Case 1 and 2 (when the function LB^q is not available or not exploited), 
no information about the exact position of non null elements inside 5k is 
provided. Now, the estimation of the sum query is made by considering all 
possible ways of distributing t non null elements in the block. Thus, if we 
partition i?k into two equal halves (by splitting i?k along one of its dimensions), 
for each configuration of i?k consisting of t' non null elements located inside 
the first half of 5k and t — t' non null elements in the other half, there exists 
a "symmetric" configuration where t — t' non null elements are in the first 
half of 5k and t' non null elements are in the second half. This implies that 
the only knowledge of t does not make the distribution of the sum s inside 
the block "unbalanced". In contrast, the information encoded in the function 
L5>o invalidates the symmetry condition described above. That is, given a 
consistent configuration of 5k containing t' non null elements inside the first 
half of 5k and t — t' non null elements in the other half, the "symmetric" 
configuration exists only if it is consistent according to the integrity constraint 
expressed by the function L5>o. 

It should be pointed out that if LB^q is not available or not used, the estimate 
provided using Case 3 does not depend on t. In fact, when only LB=q is 
exploited, the estimation process described in Case 3 works in the same way 
as Cases 1 and 2, after removing from 5k all elements which are certainly 
null according to LB=q. We can reach the same conclusion by extracting a 
formula for 5(5*3 ([i..j])) from the one provided in Theorem 5, by substituting 
L5>o([i..j]) = O'and L5>o([i~j]) = 0, thus obtaining: E{S^{[i..S\)) = {t^__. - 
tf'j) • ^, which is independent on t. 



7 Influence of integrity constraints on accuracy: some experimental 
results 

In the analysis of the accuracy of the estimated answers for the Cases 1 and 
2, we focused our attention on discussing the dependence of variance on the 



26 



ratios and t/h. The introduction of integrity constraints makes both the 
estimated answer and variance to depend on the position of inside the 
block (since the values of tj^j and tj^j change as the boundaries of the range 
move) , and on the maximum number and minimum number of non nulls 
inside the block. Therefore, it is relevant to check how much the variance 
change when we use the knowledge of LB=q{\\..'}\), Li?=o ( [i~j] ) , -f'-B>o([i--j]) 
and Li?>o([i~j]), whose values determine j, and . Next we perform 

this analysis but, for the sake of brevity, we shall only consider the presence 
of the constraints LB=q{[i..'}]) and LB=q{[iZ'}]), thus assuming LS>o([i--j]) = 
and LS>o([i~j]) = — indeed the dependency of the estimates on the latter 
classes of constraints are quantitatively the same. 

Consider a sum query of size = 500 over a block with h = 1000 and 
t — 500. Fig. 5 shows the standard deviation of the random variable 5'3([i..j]) 
versus the value of L5=o([i--j]), for different values of LB=o{[i7.ji\): the solid hne 
corresponds to the value of LB=q{[i7.^]), the dotted line to the value 10, and 
the dash-dot line to the value 20. The diagram shows that, when Li?=o([i~j]) = 
is fixed, a decreases from 84.31 to 70.44, as Li?=o([i--j]) changes from 
(which is equivalent to consider no integrity constraint) to 30. This change 
corresponds to a variation of 16% of the standard deviation. 
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Fig. 5. C7(5'3([i..j])) versus LB=o{[i..ji]) for different values of LiJ=o([i~j]) 

The decrease of the standard deviation depicted in Fig. 5 corresponds to a 
"restriction" of the datacube population on which the random variable associ- 
ated to the query is applied. In fact, evaluating a query of size over a block 
of size b whose elements have sum s is equivalent to evaluating a query of size 
— LS=o([i--j]) over a block containing b— LB=o{[i..ji\) — LB=o{[i7.ji\) elements 
with the same value of s. Thus, when LS=o([i--j]) > or LS=o([i~j]) > 0, the 
population of datacubes which are compatible with the given aggregate data 
is restricted w.r.t. both the cases Li?=o([i--j]) = and L5=o([i~j]) = 0. This 
restricted population of datacubes corresponds to a lower "degree of uncer- 
tainty" in distributing the value of s among the elements inside the blocks. 

The diagram in Fig. 6 reports o'(S'3([i..j])) versus (Li?=o([i-J])+-^-B=o([i~j]))/(^— 
t), that is the ratio between the number of the null elements localized by in- 
tegrity constraints and the total number of null elements of the block (ac- 
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cording to the aggregate data t). Fig. 6 shows that the larger the number 
-^^-B=o([i--j]) + -^^=o([i~j]) of null elements localized by integrity constraints 
(compared to the total number of nulls inside the block), the lower is the 
value of the estimated error. In Fig. 6 the sum Li?=o([i--j]) + LB =q([i7.]]) is 
denoted as LB=q. 
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Fig. 6. o-(S'3([i..j])) versus LB=o/{h - t) for b = 100,t = 50, b,_,j = 50 and s = 1000. 

The same diagram shows that the estimated error is smaller when LS=o([i--j]) 
and Li?=o([i~j]) are "unbalanced", i.e. either LB=o{[i..}]) > LS=o([i~j]) or 
LB=Q{[i..j]) < Li?=o([i~j])- As it can be easily intuited, knowing that most of 
null elements are distributed either inside or outside the range of the query 
reduces the approximation in evaluating the distribution of s inside the block. 

In sum, as expected, introducing integrity constraints on the number of null 
elements in each block influences "positively" the estimation process. We stress 
that our results are valid for random data samples so errors may be larger in 
real- world applications whose data distributions can be rather biased" , so that 
the accuracy of the estimates evaluated using the framework can be far from 
being accurate. Next, we present the results of testing our estimation models 
to a sample consisting of ten real-life two-dimensional datacubes which confirm 
the positive influence of integrity constraints on the accuracy of estimations. 

The datacubes for our experiments contain the daily incomes corresponding 
to the products sold in a chain store during periods of two months belonging 
to ten different years. Each datacube consists of a matrix made of 7580 rows 
(corresponding to all store products) and 60 columns (corresponding to the 
working days). Both count and sum queries over all the ranges of size 100 x 20 
have been evaluated for each datacube, comparing the exact answers to the 
approximate ones. In particular, different compressed representations of every 
datacube have been examined, corresponding to different sizes of the summary 
blocks; for each compressed structure, both the actual and the estimated errors 
obtained with and without the use of integrity constraints have been evaluated. 
For sum queries, the influence of using the parameter t on the query estimation 
result has been studied too. 

In our experiments, the integrity constraints consist of "macro-blocks" which 
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delimit portions of the cube consisting of all null elements or of all non-null 
elements. These macro-blocks do not identify all null [resp., non null] elements 
inside the cube, but only those null [resp., non null] elements which are inside 
a portion of the cube containing at least 20 null [resp., non null] elements. 
Macro-blocks do not overlap, and can be efficiently stored and retrieved using 
traditional indexing methods for spatial access. On the average, the adopted 
constraints located 40% of null values and 10% of non null elements inside the 
examined samples. 

In the tables of Figures 7 and 8, results obtained for count and sum queries are 
reported. The tables represent the intervals where the actual error for queries 
of size 100 X 20 arc contained, considering all datacubes. That is, each entry 
of the table shows, in percentage terms, the number of estimates whose actual 
error is less than 3x0", 4x0", and 5 x cr, for each of the estimation techniques 
proposed in Case 1, Case 2 and Case 3. 
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58.1% 


70.1% 
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86.2% 


94.6% 


98.1% 



Fig. 7. Number of count queries whose actual error is less than 3a, Aa, and 5(7 



Results reported in the tables show that: 

(1) the use of t makes the estimation of the error for sum queries more accu- 
rate; 

(2) for both count and sum queries, the accuracy of estimates benefits from 
the use of integrity constraints. In particular, a smaller coefficient to 
correct" effectively the estimate provided by a is needed, and the value 
of this coefficient is almost independent from the particular compressed 
representation of the datacube. For instance, without using integrity con- 
straints, the number of estimated count queries whose actual error is less 
than 5 X 0" is between 71.8% and 93.5%, depending on the block size. 
On the other hand, when integrity constraints are used, the number of 
estimated count queries whose actual error is less than 5 x cr is greater 
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Fig. 8. Number of sum queries wliose actual error is less than 3c7, 4c7, and ba 
than 90% for every block size. 



8 Estimation of Range Queries on Histograms 

In this section we apply our framework to derive some results about mono- 
dimensional histograms. Mono-dimensional histograms are constructed to sum- 
marize the frequency distribution of the values of a single attribute in a 
database relation, and can be exploited to estimate query result sizes [26,34,33]. 
The estimation is accomplished on the basis of the knowledge of both the num- 
ber t of non-null frequencies and the total frequency sum s in each block 
(called bucket in the histogram terminology). As mentioned in the Introduc- 
tion, a crucial point for providing good estimations is the way the frequency 
distributions for original values are partitioned into buckets. Here we assume 
that the buckets have been already arranged using any of the known tech- 
niques, and we therefore focus on the problem of estimating the frequency 
distribution inside a bucket. 

8.1 A theory for the Continuous Value Assumption 

The most common approach to estimate frequency distribution inside a bucket 
is the continuous value assumption [35]: The sum of frequencies in a range 
of a bucket is estimated by linear interpolation. It corresponds to equally 
distributing the overall sum of frequencies of the bucket to all attribute values 
occurring in it. 

Corollary 3 (where both t and s are used to estimate sum range queries) 
provides a theoretical foundation of the continuous value assumption, as it 
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states that the mean value of the random variable S2{K,^) is -^-s. Thus our 
approach gives a model to explain the linear interpolation and, besides, allows 
to evaluate the error of the estimation, thus exploiting the knowledge about 
the number t of non-nulls in a block — instead t is not mentioned in the 
computation of the mean. 

We point out that, in order to provide a more elaborated interpolation scheme, 
in [33,34] another method for estimating sum of frequencies inside a block is 
proposed, based on the uniform spread assumption: The t non-null attribute 
values in each bucket are assumed to be located at equal distance from each 
other, and the overall frequency sum is therefore equally distributed among 
them. This method does not give a correct estimation unless we assume that 
nun-nulls are scattered on the block in some particular, biased way. Next, 
using our theoretical framework, we propose an unbiased estimation inside a 
block which takes into account the number t of non-null values. 



8.2 The 1/2-Biased Assumption 



We first recall that the classical definition of histogram requires that both 
lowest and highest elements (or at least one of them) of any block are not null 
[34] (i.e. they are attribute values occurring in the relation). We call 2-hiased 
a block for which the extreme elements are not null; if only the lowest (or the 
highest) element is not null then the block is called 1-biased. 

So far linear interpolation is also used for biased blocks, thus producing a 
wrong estimation — it is the case to say a "biased" estimation. We next show 
the correct formulas, that are derived from Theorem 5. 

Corollary 4 Let he a block of a histogram, and let 5'4([i..j]) = swm(M[i..j]) 
he an integer random variable ranging from to s, defined by taking M in the 
population 11 lb^o (-^cs,f) • Then 

(1) if the block is 1-biased and i is the lowest element of the block then 
mean and variance o/5'4([i..j]) are, respectively: 

E{S,{[...]])) = + (6...j - 1) • ^ ' " ^ 



t ' t b 



<7^(54([i..j])) = a ■ (6,.j - 1) ■ H ■ 1 + (^-J - 2) ■ H 



+ 



(/3 + 2 • a) • (6,.j -l).tzi + (a + P)- ^(-S4([i..j]))' 
(2) if the block B]^ is 1-biased and\ is not the lowest element of the block then 
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mean and variance o/5'4([i..j]) are, respectively: 



t-i 



b-1 



1 + ik.i - 1) 



t-2 



6-2. 



+ 



(3) if the block -Bk is 2-hiased and either i or is an extreme element of the 
block then mean and variance o/5'4([i..j]) are, respectively: 



a\S,{[i..]]))^a-{b,...,-l) 



t-2 
6-2 



1 + - 2) 



t-3 
6-3 



+ 



+ (/?+ 2 • a) • - 1 ) • |5i + (« + /?) - ^(^4 ( [i. . j] ) r 

(4) if the block is 2-biased, and neither i nor j is an extreme element of 
the block, then mean and variance of S4^{[i..ji\) are, respectively: 



t 6-2' 
t-2 



+13 -k 



■•' 6-2 

t-2 



1 + - 1) 



t-3 



6-3. 



+ 



6-2 



where: 



a — 



'■^''^'^ and:p-'-^'-'^ 



t-{t+l) 



t-{t+l) 



Proof. 

(1) (i?k is 1-hiased and i is the lowest element of the block). In this case, 
E{S,{\i..S\)) and a\S,{\i..S\)) coincide to E{S^{\i.:^])) and a\S^{\i..S\)), 
respectively, computed in Theorem 5, by considering Li?=o([i-J]) = 0, 
LS=o(i~j) = 0, LS>o([i..j]) = 1, and LS>o(i~j) = 0. The statement of 
the corollary is thus obtained by considering that t^ — 6, t^ — 6i..j, 
t^-^ = 1 and t^ = 1. 

(2) {By_ is 1-biased and i is not the lowest element of the block). In this case, 
E(54([i..j])) and a2(54([i..j])) coincide with E{S^{{i..S\)) and a2(53([i..j])), 
respectively, computed in Theorem 5, by considering LS=o([i--j]) = 0, 
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LS=o(i~j) = 0, LS>o([i..j]) = 0, and L5>o(i~j) = 1. The statement of 
the corollary is thus obtained by considering that = b, t^^ — 
ttj = and = 1. 

(3) (Sk is 2-biased and either i or ^ is an extreme element of the block). 
Suppose that i is an extreme element of the block (the other case can 
be obtained by symmetry). In this case, £'(5'4([i..j])) and cr^(S'4([i..j])) 
coincide with E{S?,{[i--ji\)) and (T^(S'3([i..j])), respectively, computed in 
Theorem 5, in case L5=o([i--j]) — LB=q{i7.}) = 0, Li?>o([i--j]) = 1; and 
LS>o(i~j) = 1. The statement of the corollary is thus obtained by con- 
sidering that t'^ = b, t^^ = t^-j = 1 and t^ = 2. 

(4) (i?k is 2-biased and neither i nor } is an extreme element of the block). 
In this case, i?(S'4([i..j])) and o"^(S'4([i..j])) coincide to E{Ss{[i..ji])) and 
cr^(5'3([i..j])), respectively, computed in Theorem 5, in case Li?=o([i--j]) — 
LB=o{i7.i) = 0, LE>o([i..j]) = 0, and L5>o(i~j) = 2. The statement of the 
corollary is thus obtained by considering that t^ — b, t^ — t^ = 
and t^ = 2. 

□ 

The above formulas have been used in [4] to replace the continuous value as- 
sumption inside one of the most efficient methods for histogram representation 
(the maxdiff method [33]), and have produced some meaningful improvements 
in the performance of the method. 



9 Conclusion and Future Work 

In this paper we have defined a probabilistic framework for estimating range 
queries on a compressed datacube obtained by partitioning the original dat- 
acube into a number of non-overlapping blocks and then storing, for each 
block, some aggregate information on its data distribution. The proposed esti- 
mation paradigm allows us to provide an approximate answer of range queries 
(more specifically, sum and count queries) together with an estimate of the 
error of the returned answer, by accessing only the compressed representa- 
tion of the datacube. The estimates of both the answer and the error depend 
on the aggregate data and integrity constraints which arc exploited, without 
any a priori assumption on the particular data distribution inside the original 
datacube. We have investigated how the values of the answer and the esti- 
mated error depend on the available aggregate data and integrity constraints, 
by performing both an analytical and experimental study. 

We remark that the idea of introducing integrity constraints is crucial to im- 
prove the accuracy of estimations in real applications. In fact, the need of 
integrity constraints is due to the fact that the real-life datacubes are rather 
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biased" with respect to the virtual" population of datacubes on which the 
theoretical estimation process is performed. The effectiveness of the estimates 
improves as integrity constraints are introduced because the estimation pro- 
cess is accomplished on a restricted population of datacubes, and the examined 
samples are more representative of this population than the more general one. 

Therefore, further types of constraints are needed in order to catch the ac- 
tual distribution of data inside a datacube and improve the accuracy of the 
estimates. Thus, extensions of this work will follow the directions below: 

• extending the framework by considering further aggregate data on the blocks 
of the datacube, other than the sum and the number of non null values inside 
each block (for instance, the maximum and the minimum value inside the 
blocks) ; 

• taking into account data skew: this issue can be accomplished by storing 
some information regarding the number of distinct values inside each block, 
or the values with the maximum number of occurrences in the blocks. 
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APPENDIX 



Claim 1 a\S,{k.,)) = b,.,.s.^^-^-^^-^^^'^ 



b''-{b + 1) 

Proof. We start from the definition of variance: 



s...J=0 



bi.i + Si.i - 1 \ I b - + s - Si.i - 1 



2 



s...j=0 



s 



Z)si..j=0 -^i 



2 . 



can be re-written as: 



^...j=0 



6-6,.j + (,s-l)-5,.j-l \ 



s...j=o 



where: ^i..; = Si. ; — 1. 



37 



Then, by applying formula(2), the latter becomes: 

' 6-6,j+(s-l)-5,.j-l 



s,..j=i 



■J ^ ^i-j 



b+s-l 
s-1 



is-l)-S,., 



+ 



s-l 



s / b+s-l 



s-l 



6-6,,j+(s-l)-S,,.j-l V 



s-2 



..(ft^ . + 1).( (^'...j+2)+a-l . (6+2)-(6,.j+2) + (s-2)-a- 



a=0 



s_ / b + s-l 
b' [ s 



[where a = — 1] 



= b...i ■ {b...i + 1) 



b + s-l 
s-2 



b..i 



b-{b + l) 
s f b + s-l \ 



+ b,..i 



s I b+s-l 
b' [ s 



(s- 1) / b + s-l 



b + s-l 
s 



By substituting (7) in (5) we obtain: 



'b..i 



b..i 



^ b ■ + 1) • (g - 1) + 6 • (6 + 1) - -s-jb + l) 
62 ■ 6 + 1 



= bi..\ ■ s ■ 



{b-b,..i).{b + s) 
b''-(b + 1) 
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Claim 2 



tu_tL yi^ t ) (j,u_tLy(fU_-tL_i) '/ r -1-1 



if t^<t^<t^ + l 



Proof. We consider the case that >t^ + l. We start from the definition of 
variance: 

ACsii^-m = E {t., -^(^^3([i..j])))'-EP(c^3([i..j])=t...j,^3([i..j])=^...j) = 

ti..J=0 s,.j=0 

= E tlr E P(^73([i..j]) = t..j,^3([i..j]) = «..j)-(£^(^73([i..j])))^ = 

*i..j=0 ^i..j=0 



E ^f.. 



S-t+ti,,j 

E 



By applying the substitutions: = Si. j — and: S — s — t, the previous 
expression can be rewritten as: 



E 



t --t^ 



s 

■E ■ 



( ) 



-(^(C3([i..j])))'= 



- E tl 
'■■J 1..J 



/ V 



iE{Cs{[..m' (8) 
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which holds since, from (2), we have that: 



E '-^ — 

5...j=0 



S + t- + 



S + t-l 

s 



= 1. 



Let: h,„j = ti.j - t^^ = t['^ - t^^ m = t - t^, and: n ^ - 

By applying these substitutions in (8) we obtain: 



(^(^3([l..j])))^ 



'^...J=0 



n 
m 



Finally, by substituting (11), (12) and (13) in the above expression, we obtain: 



^'(C3([l..j])) 



Claim 3 



□ 



where: 



(l3+2.a4t,)-{tl,-tt^-^+{a4t;+f3-tt,)-^' 



if t^>t^ + l 



t2-(t+l) 



if =t^ y if + 1 A t = t^) 



if {t^ = t^ + l A t = t^) 



a 



''['^'^} ,i3= ''[' % and ^ = E{Ss{[i..i]))- 
t-{t+l)'^ t-{t+l)' ' ^ 



Proof. We consider the case that t^>t^-\-l. We start from the definition of 
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variance: 

= E 'Ir E p(C4([i..j]) = t...j,53([i..j]) = «:..j)-(£;(^3([i..j])))' = 

s...j=0 t,..j=0 



= E E 



t^.-t^. 



s - 1 
s-t 



E 

-3-^1.. 3 



l.J l.J 
t --t^. 



t-t^ 



s - 1 
s-t 



Si..J=ti... 



\ s — s,..j— 



-(i5(S3(li..j])))^ 



The term: ^ /'i"/ . _V X'* • 



Si..J=*i. 



can be re- written, by replacing Si.j with + ii.j, obtaining: 

S-t+t, j / \ / \ 

Si..\ - 1 W s - Si-i - 1 \ — 



Si..j=t,..j V '-J "'-J 



Since: ( ^ ] — - — ^ ^ • ( ^ ) , it results that: 

Vyy y \ y-'^ J 



E is..,? ■ i 



5, i=0 \ "^i -J / \ ^ ^'- J 
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t-t,.j + 5-5,.j-l \ 



where: Qi.j = -S'l.j — 1. 

By applying formula (2), we obtain that: 

V f ./(^>-j+i)+^-j-iYf(*+i)-(*-j+i)+('^-i)-^-j-i 

't+S-l\ 



J 



S-1 



On the other hand, 

5-1 



1 V ^i-j / \ s - 1 - Qi.j y 



V / --o ..^i:l±l.f ^-J + ^-j ].( t-t,..i + {s-i)-Q,..i 



Substituting = Qi.j — 1 the latter becomes: 

5-2 



J2 t,y{t,i + l}( + (t+2)-(t,.j + 2) + (5-2)-i?,.j-l 
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Thus we obtain that: 



s...j=o V 



S( t+S-1 
t\ S-1 



(t...+l)-f^+l 



It also holds that: 



t - t,„j + S- S,„j - 1 



-5 - s,„ 



s-1 / 

2-^...j- E 

<9i..j=o V 



and, from (2): 



5,..j=l 



x2 J t,„j + S,„j-\ 



i-J 



5., 



Thus, we have that: 



*1..J— tl..J 



E ('5...j + ^..jr-f 

s...j=o V 



t + s- 
s 



(S + t) 



(t+1) 



t + s -I 
s 



s-{s + l) 2 I s-{s-t) 
t-{t+l) ■ t-{t+l) 
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t+S-1 

s 



Ua ■ t, ^ + (3 ■ t,„j j , where: a = j^j^-^: 



P 



s-{s-t) 

t-{t + l) 



Substituting this term in (9) we obtain: 



^^(^3[l..j]) = 



1 



a 



'^'■■J-^.J 



a 



- (£(53([...jl)))^ = 



m — /ii.i 



1 + 



(B(S3(ll.jl))) = 



that is: 

^'(^3([l..j])) 



(10) 



h...i=0 



where: /ii.j = — tj^j, ii.j = ■ — m — t — t^, and: n — t^ — t^ 



Since: 



X 



X 

y 



X — 1 



we have that: 



n — li.. 



'i...j=i 



h..i - 1 



m — 
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where: Pi .j = /ii j — 1. 

By applying the Vandermonde formula (4) we obtain: 
and: 

On the other hand: 
Thus: 

n \ m j n n - 1 \ m J 

-h ..lz!i/^^-^M + i ..a .-i).lz^i.lz!!zl/^^-^M (12) 

-^-J t^-t^ [t-t^ )^ ^ ^ t^-t^ t^-t^-l [t-t^ ) ^ ^ 
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It also holds that: 



/i, j=o \ '"'"J / \ ""i-J / r,..j=o \ -^i-J 
n [ m ^ '-J t^-t^ [ t-t^ 



where: T,. j = /ii j — 1. 

Finally, substituting (11), (12) and (13) in (10) we obtain: 
a^(53([i..j])) = a . {tl, - tt,) -^-[1 + {t^s - - 1) • PS^, 

+ (/?+ ■ {t^,,^ - tts) ■ ^ + (a ■ tt/ + P-t 

-iE{Ss{[i..i]))r 
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